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C*J We predict the popularity of short messages called tweets created in the 

micro-blogging site known as Twitter. We measure the popularity of a tweet 
by the time-series path of its retweets, which is when people forward the tweet 
to others. We develop a probabilistic model for the evolution of the retweets 
using a Bayesian approach, and form predictions using only observations on 
the retweet times and the local network or "graph" structure of the retweeters. 
We obtain good step ahead forecasts and predictions of the final total num- 
|/-\ ber of retweets even when only a small fraction (i.e. less than one tenth) of 

£X| the retweet paths are observed. This translates to good predictions within a 

few minutes of a tweet being posted and has potential implications for un- 
derstanding the spread of broader ideas, memes, or trends in social networks 
and also revenue models for both individuals who "sell tweets" and for those 
looking to monetize their reach. 
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1. Introduction. The rapid rise in the popularity of online social networks 

J> has resulted in an explosion of user generated content. There is a wide variety in 

the type of content-it can be a user comment, a photograph, a movie, or a link to 

L a news article. Typically, in these online social networks, users form connections 

\£> with other users, producing a social graph. For example, in the micro-blogging site 

^+ Twitter, these connections are known as followers and the resulting social graph is 

known as the follower graph. When a user generates a piece of content, it becomes 

visible to all of his or her followers in the social graph. The content spreads through 

the social graph if these followers subsequently repost the content so their followers 

. — can see it and potentially repost it further. 

/\t In this work we focus on the micro-blogging site Twitter which has over 500 

million users as of July 2012 (Semiocast, 2012). The user-generated content in 
Twitter is composed of short messages known as tweets containing up to 140 char- 
acters, which can also contain images or links to news articles or videos. Tweets 
are spread through the Twitter follower graph by the act of retweeting, which is 
when a user forwards a tweet to his or her followers. 

Our goal in this work is to predict the popularity of a tweet by predicting the 
time path of retweets it receives. We aim to make these predictions very early on 
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in the lifetime of the tweet, sometimes within minutes of it being posted. We use 
a Bayesian model to describe the evolution of the retweets of a tweet. With this 
model we make predictions for the total number of retweets a tweet will receive 
using information from early retweet times, the retweets of other tweets, and the 
follower graph. 

There are many reasons to be interested in user retweeting behavior. First, there 
is the potential of understanding retweet behavior in and of itself. Tweets, and 
more generally, any user generated content attracts the interest of other users for 
a short amount of time ranging from a few hours to a few days. During this short 
amount of time, the tweet is potentially a source of a large number of impressions or 
views by other users. Therefore, one could imagine placing display advertisements 
within a tweet and paying the creator of the tweet on a per-impression basis, as is 
done with display ads on websites. However, websites are stable entities compared 
to tweets which are constantly generated and only popular for a short amount of 
time. A tweet will quickly "die" and no longer be viewed. Therefore, for a tweet 
to be a useful source of impressions, one must be able to estimate ahead of time 
how many potential impressions it will receive. With this predictive capability, one 
could determine with which tweet to place advertisements and compensate the user 
who generated it accordingly. 

Beyond a single tweet, understanding retweet behavior could lead to a better 
understanding of how broader ideas spread in Twitter and in other social networks. 
These ideas would consist of tweets from a large number of users on a similar topic. 
Understanding this type of information spreading would potentially allow one to 
predict which trends, memes, or ideas will become popular, how popular they will 
become, and how quickly they will become popular. Predictions of this sort have 
potential applications to marketing (new product adoption), politics (campaign ef- 
fectiveness) and national security (protests and civil unrest), to name a few. 

The remainder of the paper is organized as follows. In Section 1.1 we describe 
related work. In Section 2 we provide a description of the data utilized and an 
exploratory set of analyses of it that guides the proposed probabilistic model of 
Section 3. We present our posterior computations via Markov-chain Monte Carlo 
(MCMC) in Section 3.4. In Section 4 we present an analysis of our model's pre- 
dictive performance on our Twitter data. We discuss extensions to this research in 
Section 5. 

1.1. Previous Work. There has been much recent interest in the retweet pre- 
diction problem, albeit in terms of a slightly different type of prediction task. In 
particular, recent extant research (Bakshy et al., 2010, Zaman et al., 2010) tried to 
predict the existence of a retweet between a particular pair of users. While this is 
an important problem in graph formation or viral spreading across vertices, it is 
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notably a different problem than addressed here due to the precision and pairwise 
specificity required. 

Suh et al. (2010) used a generalized linear model to understand what features 
influenced the chance of a tweet being retweeted by anyone. Other work (Bandari, 
Asur and Huberman, 2012, Hong, Dan and Davison, 2011) built upon this and 
used a variety of algorithms to try to predict not the exact number of retweets, but 
rather a coarse interval for the number of retweets of a tweet. Similar techniques 
were used by Naveed et al. (2011) and Petrovic, Osborne and Lavrenko (2011) to 
predict the probability that a tweet receives any retweets, which by definition is 
nested within the problem we consider. 

In contrast to these previous works, we are trying to predict the entire time path, 
and hence the eventual number of retweets of a tweet. This is similar to Szabo and 
Huberman (2010) who use a linear model to predict the popularity of stories on 
Digg.com and videos on YouTube after 30 days by observing their popularity after 
one hour and one week, respectively. Our prediction goal is similar to this, but as 
we demonstrate in Section 4, our approach produces accurate predictions for the 
number of retweets using only minutes of observations, rather than hours or days. 
Given the Bayesian approach utilized here, accurate predictions are possible for a 
given tweet's retweet path even when there are no available data other than that of 
other retweet paths observed so far, especially if one utilizes covariates describing 
the tweets, retweets, and their authors (an area for future research). 

2. Data Overview. In this section, we describe the retweet data we obtained 
and present exploratory data analysis of some basic features. This analysis is useful 
in providing an understanding of the scales associated with the data (number of 
retweets of a typical tweet, time-scale over which a typical tweet is retweeted) and 
in guiding our more formal modeling choices. 

2.1. Data Description. We collected retweet data that cover a fairly wide ar- 
ray of topics and also have a wide range of retweet graph sizes. The topics include 
music, politics, and miscellaneous everyday events. Our dataset consists of 52 dif- 
ferent tweets which were selected through manual exploration of Twitter. We refer 
to these original tweets as root tweets. For each root tweet, we used the Twitter 
Search API (Twitter, 2012) to find all retweets. We used root tweets which were at 
least a week old to make sure that there were likely to be no more retweets occur- 
ring. The search API provided us with the retweet times and identity of the users 
who retweeted. Also, since the Search API could only return a maximum of 1 800 
results, we did not look at root tweets with more than this many retweets. Based 
on previous empirical studies (Cha et al., 2010, Zhou et al., 2010), this maximum 
number of retweets will cover a large fraction of tweets in Twitter and does not rep- 
resent a significant limitation. Furthermore, the statistical models here generalize 
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to larger retweet sequences. 

From the text of the retweet, we are able to identify the person that the user 
retweeted (the username following the text "RT @"). For example, if user Alice 
posted the tweet "Hello" and user Bob retweeted this root tweet, it would appear as 
"RT@ Alice: Hello". We then used the Twitter API to find the number of followers 
of the root user and each user who retweeted. The number of followers will act 
as a covariate in our predictive model. In particular, the number of followers for a 
given user represents both the potential retweet base for a given tweet and also a 
significant moderator of the speed and timing of retweets. 

We associate with each root tweet a directed retweet graph. We will utilize the 
following notation for the different data associated with the retweet graph. We 
denote the root tweet as x which is tweeted by root user Vq. The retweet graph 
associated with x which we observe at time t is denoted G x {t) = (V x (t), E x {t)). 
The vertex set V x (t) includes the root user (who tweets at t = 0) and all users 
who retweet the root tweet before time t. A directed edge («, v) G E x (t) points 
from user u to user v if v retweets u before t. We will denote the total number of 
retweets in G x {t) by m x (t) = \V x (t)\ — 1. We define the final number of retweets 
of x as lim^oo m x (t) = M x and it is the arrival of retweets and attained M x that 
we wish to predict. 

The jth user to retweet x is denoted v x for j = 1, 2, 3, ... The time of the jth 
retweet is denoted T x , with T§ = (the root tweet occurs at time 0). User v? has 
J? Twitter followers and is d x "hops" from the root user v x in the retweet graph. 
The parent of v x in the retweet graph is denoted P? To illustrate these definitions, 
we show in Figure 1 an example of the retweet graph for a root tweet. Included 
are pictures of the evolution of the retweet graph, a plot of the number of retweets 
versus time, and a table showing the aforementioned summary data for several 
users in the retweet graph. As we can see, this particular root tweet has almost all 
of its retweets at depth 1 (1 hop from the source). 

2.2. Size, Lifetime and Depth of Retweet Graphs. We first look at the size and 
lifetime of the 52 retweet graphs. The root tweets we collected had between 21 and 
1260 retweets. The time for the final retweet to occur ranged from a few hours to a 
few days as some of the final retweets had very large retweet times. A more stable 
measure of the lifetime of a root tweet is the time to reach 50% (the median) of its 
total retweet count. The median retweet times ranged from four minutes to three 
hours, with most being less than one hour. We plot the eventual number of retweets 
versus the median retweet times for the 52 root tweets in Figure 2. The correlation 
coefficient for the median retweet times and the eventual number of retweets is 
—0.12 (p-value =0.49). One conclusion that can be drawn from this is that there 
does not seem to be a strong dependence between the median retweet time and the 
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FIG 1. Data for the root tweet "Cory Booker has never worked a day in his life. Not. #corybooker- 
stories" by root user pbsgwen. The table shows the relevant data for the retweet graph for several 
users. The plot shows the number of retweets of the root tweet versus time. Images of the retweet 
graph at different times are also shown. 
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FIG 2. Total number of retweets versus median retweet time for different root tweets. 



final size of the retweet graph. 

We next explore the structure of the retweet graphs. In particular, we look at 
the number of vertices one hop and more than one hop from the root user. For the 
52 root tweets, there are 1 1,882 users that retweet who are one hop from the root 
user and only 314 users more than one hop from the root user. Figure 3 shows 
the histogram of vertices at different depths in all of the retweet graphs, along 
with a plot of the fraction of vertices more than one hop from the root user for 
each retweet graph. As can be seen, retweet graphs typically have most vertices 
at depth one, but occasionally they have some vertices at depth greater than one, 
suggesting that root tweets get retweeted much more often than the retweets get 
retweeted. This fact agrees with previous studies done on retweet graph structures 
(Goel, Watts and Goldstein, 2012, Kwak et al., 2010) and is key to our ability to 
predict M x early, even before potential retweets from those two hops or more are 
taken into account. 

2.3. Reaction Times. Given, as before, that the jth retweet of the root tweet 
occurs at time Tf by user v? we define the reaction time S x = T? — Tp X as the 
elapsed time between when the parent of v x (re)tweets and v x retweets. That is, S x 
is the time that it takes Vj to react and retweet after the root tweet becomes visible 
to v x via its parent's (re)tweet. Figure 4 provides a graphical explanation of the 
reaction times in terms of retweet times. 



To begin a more formal exploration of our data, we first consider a simple and 
non-Bayesian model in which each Sj is assumed to be an i.i.d. log-normal random 
variable with parameters t x and a x : log(S x ) ~ N(a x , (t x ) 2 ). We assume that the 
parameters of the log-normal are different for each root tweet x, but the same for 
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FIG 3. (left) Histogram of the fraction of users at different depths in all 52 retweet graphs, (right) 
Histogram of the fraction of vertices of depth greater than one in the retweet graph for each root 
tweet. 
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FIG 4. Description of reaction times for a retweet graph. The vertical position of vertices indicate 
when they retweeted, with time increasing as one goes down. The reaction time on each edge Sj is 
expressed in terms of the retweet times Tj of the vertices. 
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each user within a given retweet graph. This assumption takes into account the fact 
that there can be heterogeneity of these parameters which depends on the content 
of the root tweet. 

To assess the log-normal assumption, we calculate the maximum likelihood 
(ML) estimate of a x and t x for each root tweet. Given a set of reaction times 
S x for j = 1,2..., M x , the ML estimates are straightforwardly given by 



71/ 



M x 



-« 1V1 -i IV1 

In Figure 5 (top right) we show a scatter-plot of a x ML and rf /IL for different root 
tweets x. The mean and standard deviation of a x ML is 7.31 and 0.73, respectively. 
The mean and standard deviation of rf [L is 2.31 and 0.31, respectively, and we 
clearly see some heterogeneity over x. To assess fit, we show in Figure 5 the empir- 
ical complimentary cumulative distribution function (CCDF) of the reaction times 
along with the CCDF of a log-normal distribution using the ML estimates for the 
parameters for three root tweets representing the 2.5 (small size, top right), 50 
(medium size, lower left), and 95 (large size, lower right) percentiles of retweet 
graph size in our dataset. Qualitatively, the log-normal curves provide a reasonable 
fit for the reaction times. 

The observation of log-normally distributed reaction times has occurred in other 
application areas. For instance, Stouffer, Malmgren and Amaral (2006) observed 
that the time for people to respond to emails follows a log-normal distribution. 
Brown et al. (2005) observed that call durations in call centers follow a log-normal 
distribution. In the psychology literature there have been different models proposed 
to explain the origin of log-normal reaction times in different contexts (Ulrich and 
Miller, 1993, van Breukelen, 1995). However, these models do not apply directly 
to Twitter and it is interesting to see the same general empirical pattern replicated 
here. 

2.4. Retweet Graph Structure. In this section we provide an initial exploration 
of the effects of the number of followers, /?, and distance from the root, d x , on 
the probability of a user's tweet being retweeted. Once a user v? (re)tweets in the 
retweet graph for a root tweet x, the (re)tweet appears in the Twitter feed (time- 
line) of all of Vj's followers. Some number of these followers will subsequently 
retweet v? We denote this number by MJ, which is equal to the out-degree of 
v x in the completed retweet graph once the root tweet has stopped spreading. We 
assume that each of the f x followers of v x will independently retweet v? with 

probability < b x < 1. This gives M x a binomial distribution Bi (ff,bj). We 
note that this assumption of conditional independence across followers is reason- 
able because retweeters are unlikely to be connected to other retweeters, and hence 
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FIG 5. The top left figure is a scatter-plot of ML estimates ofa x and T x for different root tweets. The 
remaining figures are plots of the empirical reaction time complimentary cumulative distribution 
function (CCDF) (black circles) and the CCDF of log-normal distributions using the ML parameter 
estimates (solid line) for three different root tweets representing the 2.5 (top right), 50 (bottom left), 
and 95 (bottom right) percentiles of retweet graph size in our dataset. For each root tweet, we show 
the root user for the tweet and the number of retweets in total it received. 
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there is no "visibility" between the /? followers. For other networks there may be 
generalizations needed. 

We assume b x depends upon two pieces of information: the number of follow- 
ers f? of Vj and the distance d x of v x from v x in the retweet graph. This makes 
conceptual sense as these two variables represent the potential retweet base and the 
"degree of closeness" of each vertex respectively. We model logit(6|) as 

(1) logit(^) = Po + P f log(// + 1) + P d log(<£ + 1) + e x 

where e x ~ A/"(0, a%). In a further exploratory analysis (formal model in Section 
3), for each user vj we estimate W as b x = M? / '/?. We then perform a linear 

regression of logit(fe) on log(/? + 1) and log(d| + 1) for all users in all root 
tweets. For this exploratory analysis we only include users for which M? > 1 so 
that logit(6|) will be finite. 

The ML estimates of the regression coefficients are /3o = 1.99, /?/ = —0.79, and 
/3rf = —4.31 and the p-values of the corresponding t-statistic are all significantly 
less than 0.001, indicating a high significance for each coefficient. In Figure 6 we 
plot logit(6f) -fa-Td log(d* + 1) versus ff and logit(6|) - f - f f log(/f + 1) 
versus (E in order to show the isolated effect of each covariate. 

The value for j3j is negative, which is expected given the way 6| is defined, 
but the value is greater than — 1, indicating that there is some non-trivial relation 
between Mf and /J. Specifically, this result says that the the average value of M? 
scales as bf /f ~ (ff) c for some < c < 1. Therefore, the number of retweets 
should grow with the number of followers a user has, but at a decreasing rate. 
The value for (3d is also negative, indicating that after controlling for /J, a retweet 
is less likely the farther we get from the root user. Both of these findings are in 
accordance with previous research on retweet graph structure (Goel, Watts and 
Goldstein, 2012, Kwak et al., 2010) and provides face validity to our results. 

3. Retweet Model. Our data analysis in Section 2 provides us with insights on 
the important properties of the dynamics of retweeting and the structure of retweet 
graphs. Based on these insights, we propose a Bayesian model for the evolution of 
the retweet graph of a root tweet. 

3.1. Log-normal Model for Reaction Times. From our exploratory analysis, 
we saw that a log-normal distribution provided a reasonable fit for the reaction 
times. There was some variation in a x and t x across tweets. Therefore, we choose 
the following model for the reaction times. For each root tweet x we model log(S x ) 
as normal with a tweet specific mean a x and standard deviation t x . We place a nor- 
mal prior on a x and an inverse-gamma prior on (r x ) 2 , in accordance with standard 
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visibility of the data. 



hieararchical Bayesian models (cf., Gelman and Hill, 2007). In particular, 



(2) 
(3) 
(4) 



log^laV^A^Cr*) 2 ) 
a x \a, o"a ~ N{a, a^) 



(t x Y ~ IG(a T ,b T ). 



To complete our hierarchical Bayesian specification and ameliorate issues with 
hyperparameter sensitivity, we use the following hyperpriors: 



(5) 
(6) 
(7) 
(8) 



o\ ~ IG(a A ,bA) 
log(a T ) ~7V(^ a ,o-„) 
6 T ~ Gamma (kb, t 



and note that exact hyperparameter values, selected to be uninformative, are pro- 
vided in the Appendix. The graphical model for the reaction time component of the 
model is shown in the bottom portion of Figure 7 (see node S x and all associated 
connections), and demonstrates the cross-tweet shrinkage that is allowed by our 
model. 

3.2. Binomial Model for Retweet Graph Structure. We saw initial evidence 
that the retweet probabilities b x showed dependence on /? and d x . Using this in- 



imsart-aap ver. 2009/12/15 file: retweetPrediction_vl5_arxiv.tex date: April 26, 2013 



12 



T. ZAMAN, E. B. FOX AND E. T. BRADLOW 



Tweet x 




FIG 7. Graphical model of the Bayesian log-normal-binomial model for the evolution of retweet 
graphs. Hyper-priors are omitted for simplicity. The plates denote replication over tweets x and 
users v*. 



sight, we propose the following model for the retweet graph structure: 

(9) m;i/;, &? ~bi (/;,&?) 

(10) logit(6J)|/iJ,cT 6 ~AA(^',a 6 2 ) 

where we define 



(11) 



M | =Po + f3 f log(/J + 1) + fa log(d? + 1). 



This model allows for the possibility of the number of followers, H, and the depth 
of the retweet from the root, cE, to influence the number of eventual retweeters. 
The influence of the covariates, as determined by j3f and /3d, is shared across root 
tweets x. As with the reaction time model, we put hyperpriors on these global 
model parameters: 



(12) 
(13) 
(14) 
(15) 



Pd~M(P0 d ,op d ) 
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where we specify the specific (uninformative) hyperparameter values in the Ap- 
pendix. The combined model for reaction times and the graph structure is shown 
in Figure 7. 

3.3. Likelihood Function. We now derive the likelihood function for our retweet 
model. We partition our dataset into two types of tweets, training tweets and pre- 
diction tweets. The training tweets are fully observed retweet graphs. That is, we 
observe all reaction times (SJ) along with the final degree (M?) of each vertex 
in the retweet graph. For the prediction tweets, we observe the retweet graph at a 
time t and therefore only observe a fraction of the reaction times and the current 
degree of each vertex which we denote by m x (t). We do not observe the Mj's in a 
prediction tweet 1 and therefore we treat these as missing data. Note that with this 
notation we have that lini£_> 00 rrVj(t) = M? i.e. M x is the value of m x (t) in the 
limit when the tweet has stopped spreading. 

We first derive the likelihood for a training tweet. The observed data for a train- 
ing tweet are S x = U*=i Sf and M x = UjTi Mf. Recall that in our model 
log(Sj) ~ J\f(a x , (t x ) 2 ) Therefore, if we define h x = U|=i &?> tne conditional 
distribution of the observations is given by 

P{S x ,M x \a x ,T x ,b x )=P(M x \b x ,Fg) 

MX 1 / {log(S x ) - a x ) 2 ^ 



r.r \ 



o« n ^-p (- - ,,;.,, - ) mm, ts 

where P(M x \b x , ff) is given by the binomial of equation (9). 

For the prediction tweets, we do not observe the MJ's and so will need to 
marginalize over them. First, we derive the conditional distribution of the observa- 
tions Sf = U^f* 5 S x and mf = \J ™T (i) m x {t) conditional on Mf = \J™*® Mf, 
a x , and t x . With this conditioning, the contribution to the probability from each 
vertex v x observed by time t has three components: 

1. The log-normal likelihood of its observed reaction time (equation (2)). 

2. The unobserved retweets of its children in the retweet graph. That is, for each 
vertex v x that retweets at time S x < t, we have m x (t) observed retweets by 
time t and M?' — rrij(t) unobserved retweets. Because we are making the 
observations at time t, these Mf — m x (t) reaction times must be greater 
than t — T x . Therefore, if we define the cumulative distribution function of 
N(a x , (t x ) 2 ) as F(-\a x ,r x ), the contribution to the conditional distribution 
is (1 - F(log(i - Tf)\a x ,T x )) M J- mX i {t \ That is, M x - m x (t) potential 
retweeters of a v? have not done so yet (or we would have observed it by 
time t). 



'Except in the degenerate case where mj = ff, in which case Mf = mj. 
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A/f x 

3. A combinatorial term ( J. ,) which must be included because the unob- 

x m x (t)' 

served retweets from the children of v x could be any M? — m x (t) of its M? 
children. 

Putting these components together, the distribution of the prediction tweet ob- 
servations, conditional on the missing M? is given by 

two* x. x x ™*N "t? ! / (log^f)-^) 2 ^ 






^r* r V 2 ( rX ) 2 






M x -m x (t) 



To obtain the complete data likelihood, we simply multiply equation (17) by P(M?|fr? H) 

m x (t) 

and sum over all possible values of M? If we define b x = U 7 =o "f > tnen tne 
marginal likelihood is 

P(S;-.m,>-'-.r-.b;-) = II -L-expt-^^^' 1 "'^ 



x\2 




J 



2(-r B ) 
3 l 'l-F(log(t-5n|a x ,r x l 



(18) P(M x \b x J x ) 



3 < 3> J 3 

Since this equation does not yield a closed form, we rely on imputing the missing 
M? as described next in Section 3.4. 

3.4. Posterior Computations. To summarize, our goal is to calculate a pre- 
dictive distribution for reaction times, and hence M|L ( the number of eventual 
retweets of a root tweet x) given a set of observed (training) retweet paths and 
the partial history of the prediction tweet x observed up to time t x . Recall that 
our model consists of three types of parameters. First, there are the global pa- 
rameters <£ = {a, cta, Or, b T , /3o> /3/, Pd, o~b} which are shared between tweets and 
govern the heterogeneous distributions. Second, there are tweet specific parameters 
a = \J X a x and t = \J X r x . Third, there is a tweet and user specific parameter: the 
retweet probability b x . We define the set of all retweet probabilities as b = (J • b x . 

The final vertex degrees (MJ) are missing data for the prediction tweets. We 
define V as the set of prediction tweets and T as the set of training tweets. We 
define the set of unobserved M? for the prediction tweets as Mj> = [j x& p j M? 
and the set of observed M? for the training tweets as M7- = UagT j Mf. We 
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define the set of observed reaction times for both the training and prediction tweets 
as S = \J x j Sf. Using the conditional dependencies in our model as laid out in 
Figure 7, the posterior distribution of the model parameters and M-p given S and 
M-p can be written as 

P ($, ct, r, b, Mp|T, M r ) ex P ($) J! p (c^K o"a) P (r x \a T , b T ) 

X 

l[P(S x \a x ,T x ,M x ) 
xeT 

(19) ~[[P(S x x ,m x x \a x ,T x ,M x x ). 

xeV 

To examine our desired predictive distribution of M-p, we sample from equa- 
tion (19) using an MCMC sampler which involves sampling the model parame- 
ters in addition to Mp. The predictive distribution is approximated by considering 
the samples of Mp. The details of the stages of our sampler are provided in the 
Appendix. 

4. Results. We partition our dataset into a set of 26 training tweets T and a 
set of 26 prediction tweets V . We randomly divide the tweets such that the training 
and prediction sets have similar retweet count distributions. We aim to calculate 
the predictive distribution for Mp using a fixed observation fraction of retweets for 
each prediction tweet. For instance, for an observation fraction of 10%, we used 
as observations all data from the 26 training tweets, and the first 10% of the total 
number of reaction times for each of the 26 prediction tweets. Note that by fixing 
the observation fraction, we are observing each prediction tweet up to a different 
time. We use observation fractions ranging from 10% to 100%. 100 % represents a 
fully in-sample analysis, and lower fractions are used to understand how early on 
in a tweet's life predictions can be made. 

For each observation fraction, we generated posterior samples using three inde- 
pendent MCMC chains with dispersed starting points run for 3,000 iterations and 
discarding a burn-in period of 1,000 iterations. Convergence of the MCMC sam- 
pler was assessed using the Gelman-Rubin statistic (Gelman and Rubin, 1992). A 
histogram of the posterior samples of the global parameters for an observation frac- 
tion of 100% is shown in Figure 8 and the corresponding posterior means are show 
in Table 4. We find that the posterior mean of a is 7.42, which is comparable to the 
mean of the ML estimates of a x from Section 2.3 (7.31). Also, the 90% posterior 
credible interval of the /3 parameters do not contain 0, indicating that these param- 
eters are important to the predictive power of our model and agree with our earlier 
analyses from Section 2.4. 
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FIG 8. Histograms of posterior samples of global parameters with an observation fraction of 100%. 

In Section 4.1 we describe our prediction results for the number of eventual 
retweets, followed by an analysis in Section 4.2 that looks at the impact of the 
number of followers (/J) and the depth of the retweeters (d x ) on our predictions. 



Parameter 


Posterior Mean (s.d.) 


a 


7.42(0.10) 


CTA 


0.65 (0.07) 


o T 


0.45 (0.07) 


b T 


2.11(0.55) 


o-b 


1.69(0.18) 


A> 


-4.61 (0.85) 


Pf 


-0.28 (0.06) 


Pd 


-8.22 (0.59) 



Table 1 

Posterior means and standard deviations (s.d.) for the global model parameters with an observation 

fraction of 100% (a fully in- sample analysis). 



4.1. Retweet Prediction Results. The predictions of our model for the total 
number of retweets come from the M? of the observed (re)tweeters. For instance, 
if at time t x we observe m x (t x ) retweets, our prediction of the total number of 
retweets is given by the predictive distribution of z_, 7= o M? ■ This serves as a 
step-ahead forecast of M x . We discuss possibilities to go beyond this step-ahead 
prediction in Section 5.1. 

Our predictions are for observation fractions ranging from 10% to 100%. The 
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prediction results for four different root tweets are shown in Figure 9. We plot the 
median and 90% posterior credible intervals for the total number of retweets for 
different observation fractions. The predictions are plotted along with the number 
of observed retweets versus time. From these plots, it can be seen qualitatively that 
the predictions made within a few minutes for the eventual number of retweets are 
relatively close to the true value. 

To better understand the model predictions at the individual tweet level, we show 
boxplots of the posterior distribution of the percent error for each prediction tweet 
for different observation fractions in Figure 10. The whiskers on the boxplots are 
the 90% posterior credible intervals. As can be seen, as we increase the observation 
fraction, the prediction error tends to decrease. There are a few tweets which have 
exceptionally large errors at a 40% observation fraction. We discuss these tweets 
in Section 5.2. 

We can aggregate these results across all prediction tweets by looking at the ab- 
solute percent error (APE) of predictions made using the posterior median as our 
prediction value. In Figure 1 1 we show a boxplot of the APE for all 26 prediction 
tweets versus observation fraction. As can be seen, for our model the median APE 
(MAPE) is below 40% for observation fractions ranging from 10% to 100%. The 
average retweet time of the prediction tweets at a 10% observation fraction is 4.4 
minutes. Therefore, we see that using only a few minutes of observations, we can 
predict with reasonable accuracy the total number of retweets given the small frac- 
tion of observations. To get a sense of how good the predictions are, consider the 
MAPE at 10% and 100%. At 10%, if one thought that there were no more retweets, 
the error would be 90%. Our model's median error is less than 40%, which means 
that the model predicts that the tweet will receive many more retweets. At 90%, if 
one thought the there were no more retweets, the error would be 10%. Our model's 
median error is less than 10%, which means that the model predicts that the tweet 
is almost done spreading. Therefore, we see that our model can predict if a tweet 
has a significant amount of (retweet) life left or if it is near its end. 

4.2. Impact of H and <E. To show the importance of ff and (E to our retweet 
model, we compare to a strawman model which ignores these covariates. The 
strawman model assumes that M? comes from a Poisson distribution (not bino- 
mial as before since /? is unknown) with global rate A, but keeps the reaction time 
component of the retweet model the same. We put an uninformative gamma prior 
on A with shape and scale parameters 1 and 500, respectively. 

We use the median of the predictive distribution as a point estimate of the num- 
ber of retweets in comparing our model's performance to that of the strawman. 
In Figure 1 1 we show boxplots for the absolute percent error (APE) of the two 
models' predictions for all of the prediction tweets versus the observation frac- 
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FIG 9. Prediction of the total number of retweets for four different root tweets. The solid line repre- 
sents the number of observed retweets versus time. The solid square is the posterior median of the 
predictive distribution for the total number of retweets based on observations only up to that time 
point. The error bars correspond to the 90% credible intervals. The horizontal dashed line is the 
final number of observed retweets M x . The root user and total number of retweets of each tweet are 
shown in the plots. 
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FIG 1 0. Boxplots of prediction absolute percent error (APE) for 26 prediction tweets. Each plot 
corresponds to a different observation fraction of retweets. 



imsart-aap ver. 2009/12/15 file: retweetPrediction_vl5_arxiv.tex date: April 26, 2013 



20 



T. ZAMAN, E. B. FOX AND E. T. BRADLOW 



100 

90 

80 

70 

#60 

LU 50 

< 40 

30 

20 

10 





Retweet Model 



a a 



10 20 30 40 50 60 70 80 90 100 

Observed Fraction [%] 



100 
90 






Strawman Model 




80 


s 






- 


70 








- 


#60 




■T 




■ 


LU 50 
Q. 
< 40 






^ 


- T 


30 






a - 






— . 


20 

10 






"?6b9 


I 


■i- . 




10 


20 


30 40 50 60 70 80 

Observed Fraction [%] 


90 




100 



FIG 1 1 . Boxplots of the percent error of the retweet model, strawman model at different observation 
fractions. 



tion. For an observation fraction of 10% (where predictions are most useful) the 
error of the strawman model is very high (MAPE = 80%) compared to our model 
(MAPE=29%). Also, while our model's error tends to decrease as more retweets 
are observed, the strawman model's error decreases to a point and then increases 
again. The strawman model's prediction for the total number of retweets is essen- 
tially a constant multiplied by the number of observed retweets. To make this more 
evident, in Figure 12 we plot the MAPE versus observation fraction for both mod- 
els and and a naive model which predicts lAm x (t x ) for the eventual number of 
retweets. As can be seen, the error of the strawman is very similar to the naive 
model. 

To assess the overall fit of the two models, we compare their average log- 
likelihood (LL) and deviance information criterion (DIC) (Spiegelhalter et al., 
2002) for an observation fraction of 100% in Table 2. Models which fit better have 
larger values for the LL and smaller values for the DIC. As can be seen from Table 
2, our model has a significantly better fit than the strawman model. This analy- 
sis demonstrates that /J (user information) and d x (retweet graph structure) are 
important elements for predicting retweets accurately. 

Retweet Model Strawman Model 



LL -38,860 -103,907 

DIC 83,848 208,026 

Table 2 
Average log-likelihood (LL) and deviance information criterion (DIC) for a 100% observation 
fraction for the full retweet model and the strawman model. 



5. Model Extension Opportunities. We next discuss various extensions to 
our retweet model. We first discuss improving our prediction using future poten- 
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FIG 1 2. Plot of the median absolute percentage error (MAPE) versus observation fraction ofretweets 
for 26 root tweets. The three curves are the MAPE for the retweet model, a strawman model which 
ignores fj and d?, and a naive model which always predicts lAm x (t x ). 



tial retweeters. Then we discuss evidence in our data which suggests possible ex- 
tensions to our reaction time model. Finally, we discuss the incorporation of side 
information for the tweets. 

5.1. Distribution Over Future Potential Retweeters. Our current prediction is 
based on eventual retweets from existing users in the observed retweet graphs and 
does not take into account retweets of future retweeters who have not yet been 
observed. We can think of this prediction as a step-ahead forecast of the total even- 
tual number of retweeters. In practice, it quickly provides a good estimate since 
most retweet graphs have low depth. However, one could extend our prediction 
to account for the eventual retweets from users who have not yet been observed, 
in particular, by integrating over our uncertainty. This type of prediction would 
require greater knowledge of the structure of the underlying follower graph. For 
instance, if a user has a follower with a large number of followers, this user may 
receive a large number of retweets due to a retweet from this follower. Therefore, 
incorporation of unobserved retweeters could potentially improve our predictions, 
but would require obtaining more data on the follower graph. Note, however, that 
under the (experimentally validated) assumption that the probability of retweeting 
decreases with depth, the sensitivity of our predictions to inaccuracies of future 
retweeter information is likely minimal. 

5.2. Reaction Time Modeling. At an observation fraction of 40% there are four 
different tweets with very large errors compared to the other tweets. We looked at 
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FIG 1 3. Plot of median reaction time versus A x for the prediction tweets. The triangle points are the 
tweets with large prediction errors at 40% observation from Figure 10. 



these tweets more closely to try to understand the source of this error. The num- 
ber of retweets for these tweets ranged from 73 to 608. What these tweets had in 
common was the fact that the number of retweets increased very rapidly at first, 
and then slowed down considerably. This behavior deviated from the log-normal 
reaction time model. 

If the reaction times were log-normal, then their logarithms would be normally 
distributed and the difference between the median and mean of their logarithms 
would be zero. Any deviation of this difference from zero can be viewed as a de- 
viation from log-normality. We define A x as this difference normalized by the 
median of the logarithm of the reaction times: 



A x 



mean(log(5J)) — median(log(£J) 
median(log(S^)) 



To show the similarities of the four high error tweets, in Figure 13 we plot A x 
versus the median reaction time for each prediction tweet. The four triangles in 
the plot are the tweets with the large errors. As can be seen, these tweets have a 
short median reaction time along with a large value for A x . Therefore, it seems 
that these tweets have reaction times that are not well modeled by the log-normal 
distribution, which leads to the larger prediction errors. It is an interesting area of 
future research to try and understand what properties of these tweets and the users 
who posted them cause this type of retweeting behavior and why the reaction times 
are not well modeled by the log-normal distribution. 
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5.3. Incorporation of Side Information. Our model relied primarily on the tim- 
ing information of retweets, depth in the retweet graph, and number of followers 
for predictions. However, there are other types of side information that we could 
incorporate which may potentially improve the accuracy of the predictions. One 
type of side information is the time of day. It may be that the retweet behavior of 
a tweet depends upon the time is was posted. Another type of side information is 
the content of the tweet. For instance, retweet behavior may depend upon the topic 
of the tweet, and whether or not that topic is a currently trending topic in Twit- 
ter. These types of side information can be readily incorporated into our modeling 
framework as covariates for the parameters such as a x and K. 

6. Conclusion. We have presented a model for retweet dynamics in Twitter. 
Our Bayesian approach allowed us to provide predictions for the total number of 
retweets, along with posterior credible intervals for the predictions. The predictions 
had a MAPE of less than 40% when at least 10% of the total number of retweets 
were observed. For most tweets, this translated to an average error less than 40% 
within 5 minutes of the tweet being posted. 

We have shown that given the size of the retweeter network and depth from the 
source tweet, we are able to predict the number of potential viewers received by 
a tweet. The level of accuracy in our predictions allows us to consider using this 
model for different applications. For example, it can be used to turn tweets into a 
potential source of impressions for display ads. Because tweets are typically only 
actively retweeted for a few hours, the early predictions our model provides are key 
to detecting a popular tweet before it receives a large amount of retweets. Also, the 
similarity of the manner by which people spread content in social networks suggest 
that this model can be used for other social networks such as Facebook. Therefore, 
our model's early predictions could create a whole new source of impressions for 
online advertising on dynamic social network content with a finite "lifetime". 

Finally, because this model is for a single tweet, it can be used as the foundation 
for a more general model for the spread of broader ideas which involve multiple 
tweets from multiple users. Our model can easily be parallelized via techniques 
such as MapReduce to analyze very large collections of tweets. With a model for 
the spread of ideas, we could develop a better understanding of how memes and 
trends spread and potentially predict the speed and magnitude of their popularity. 

APPENDIX A: DETAILS OF MCMC SAMPLER 

We use a Metropolis-within-Gibbs scheme to sample from the posterior distri- 
bution of the model parameters. We define the set of model parameters as = 
{<J>, b, a x , Mp} and for any parameter 7 6 0, we define the set of parameters ex- 
cluding 7 as 0_ 7 . We also define the set of observed reaction times as S. For our 
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MCMC sampler, we must sample from the conditional distribution 7JS, M7-, 0_ 7 
for each model parameter. We will now derive these conditional distributions and 
show how to sample from them. 



A.l. Retweet Graph Structure Parameters. 

Hyperparameters (3q, j3p, (3d, <J 2 . The prior distributions for /3q, /3p,and (3d are 
normal with mean and standard deviation ap = 100. It can be shown that the 
joint conditional distribution of (f3o, (3p, (3d) is multivariate normal with mean \i 
and covariance matrix C. Because of this, we can directly sample the /3's in a 
Gibbs step. We simply need to determine /z and C. To do this, first we let N be 
the total number of observed reaction times for all training and prediction tweets. 
To express the mean and covariance of the conditional distribution, it is helpful to 
define the following variables. 



JVi = N + o%<r'c 



D 

F 



Y 



£log(^ + l), 
Elog(// + l), 

+ 1). 



o = £log(6* 



E = 

D 2 
F 2 
Y F 



+ 1) log(^ + 1) 



Y d = E log 2 (6| + 1) log(cf + 1) + a 2 b a/ 

x,j 



Eiog(/. 

= pog 2 (d* + l) + a 2 b a/ 
--pog 2 (ff + l) + a 2 b a/ 

= £iog(&* + i)io g (/; + i) 

x,j 



Then the covariance matrix of the conditional distribution is given by 



and its mean is given by 



^ 





JVi F 


D 


-ol 


F F 2 


E 




D E 


D 2 


Ni F D 


-1 


F F 2 E 




D 


E D 2 





Y 
Y F 

Y d 



The prior distribution of a 2 is inverse-gamma with shape and scale parameters 
a ab = 0.5 and b ab = 0.5, respectively. We can directly sample from the conditional 
distribution for a 2 because it is inverse-gamma with shape parameter a' ff and scale 
parameter b' ab given by 



o"6 



+ 



N 



b' 



<Tl, 



b ° b + E ( l °sHb 



('■', 



■1-J 
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where tf = (3 + f3 F log(// + l)+0 d log(d* + 1). 
Parameters K. The conditional distribution of b x is given by 

p (V*|s, m t , e_„) oc p (m;|6j) p (&*|/? , &■, &, ^) 



.,^(i- 6J f-"- raP -ffi, 



, 2 " 

logit(6f ) - juf 




2a 

To sample from this conditional distribution, we use a Metropolis-Hastings step 
with the proposal value for logit f b x J drawn from a normal distribution with mean 
\i x ; and standard deviation aj,. 

Missing Mf. The conditional distribution for Mf is 

P(Mf|S,M r ,e_ M? )oc ( '"■; | {l~F(\og(t-S*)\a x ,r] y ' 

\f x -M x ( i 

l-b*)> >l{M?>m<}} 

We generate samples from this conditional distribution using a Metropolis-Hastings 
step with the proposal for MJ drawn from a binomial distribution Bi(/J, 6|). 

A.2. Retweet Time Parameters. 

Hyperparameters a, a\, a T , b T . We utilized an extremely diffuse prior distri- 
bution for a that is normal with mean and standard deviation a a = 100. The 
conditional distribution of a is again normal with mean /j,' a and variance cr'£ , so it 
can be directly sampled. If we define the total number of root tweets (training and 
prediction) as N t , then the mean and variance are 

Ma = 

The prior distribution of a\ is inverse-gamma with shape and scale parameters 
a aA = 0.5 and b aA = 0.5, respectively. We can directly sample from the condi- 
tional distribution for a\ because it is again inverse-gamma with shape parameter 
a' UA and scale parameter b' aA given by 

a * A = «^a + -y 

t A = b* A + g E («* " «) 2 • 

X 
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The prior distribution of log(a T ) is normal with mean p, a = and standard 
deviation a a = 10. The conditional distribution of a T is given by 



P (drlS, M r , 9_ a *) OC P (or) J] P (^K, W) 



To sample from this conditional distribution, we use a random walk Metropolis- 
Hastings step. That is, if we define the ith sample of a T as a T: i, the proposal for the 
(i + 1) sample is drawn from a normal distribution with mean a Tj j and standard 
deviation 0.2, where 0.2 is chosen to balance the acceptance rate with step size. 

The prior distribution of b T is gamma with shape parameter kb = 1 and scale 
parameter 9b = 500. We can sample directly from the conditional distribution of 
b T because it is gamma with shape parameter k' b and scale parameter 9' b given by 

k' b 

-i 

% 

Parameters a x , r x . The conditional distribution of a x depends upon whether the 
root tweet is in the training or prediction set. For training tweets, the conditional 
distribution of a x is normal with mean fi a x and variance a 2 a with 




-i ^ 



[l a x 



(M x + r 2 a A 2 Y ]>>g( 5 i 



i=i 



a a x = \M + r a A I r . 



For a prediction tweet with n observed retweets, the conditional distribution of a x 
is given by 



P(a x |S,M r ,0_„*) ocexp 




(log(r/ +1 ) - a 
2t 2 



x\2> 



l-F(log(t-S x )\a x ,ry 



To sample from this conditional distribution, we use a random walk Metropolis- 
Hastings step. We define the ith sample of a x as a x , the proposal for the (i + 1) 
sample is drawn from a normal distribution with mean a x and standard deviation 
0.2, where 0.2 is chosen to balance the acceptance rate with step size. 
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The prior distribution of (t x ) 2 is inverse-gamma with shape and scale param- 
eters a T and b T , respectively. We denote the inverse-gamma density function by 
IG(-\a T , b T ). The conditional distribution of (t x ) 2 can be written as 



1 3 



P(V) 2 |S,M r ,e_ r ) ocIG((r*) 2 K,&;) I] (l-F(log(t-S*)\a x ,T: 

xev 

where the parameters of the inverse-gamma density function above are 

. . rn x (t) 

a T = a T + ~^— 

1 mX{t) / . 2 

b 'r = b r + 2 E (.M^) " «*) • 
3=1 

For training tweets, M x = m x , so the conditional distribution is inverse-gamma 
and we can sample t x directly. For prediction tweets, we must use a Metropolis- 
Hastings step with the proposal value for (t x ) 2 drawn from an inverse-gamma 
distribution with shape and scale parameters a' T and b' T , respectively. 
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